【Reading】R for Data Science Part 1 C1-C16

Book
Author

Tony Duan

Published

November 10, 2022

https://r4ds.had.co.nz/

R for Data Science by Hadley Wickham & Garrett Grolemund

1-2 Introduction

Code
library(tidyverse)
#tidyverse_update()

library(nycflights13)
library(gapminder)
library(Lahman)
Code
1+2
[1] 3

3 Data visualisation

ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.

Creating a ggplot

To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis:

Code
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

map the colors of your points to the class variable to reveal the class of each car.

Code
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

4 Workflow: basics

assignment statements

Code
a=1
a
[1] 1

Calling functions

Code
seq(1, 10)
 [1]  1  2  3  4  5  6  7  8  9 10

5 Data transformation

Code
library(nycflights13)
library(tidyverse)
Code
glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

int stands for integers.

dbl stands for doubles, or real numbers.

chr stands for character vectors, or strings.

dttm stands for date-times (a date + a time).

lgl stands for logical, vectors that contain only TRUE or FALSE.

fctr stands for factors, which R uses to represent categorical variables with fixed possible values.

date stands for dates.

filter row:

Code
filter(flights, month == 1, day == 1)
# A tibble: 842 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Comparisons:

Code
1 == 1
[1] TRUE

or:

Code
filter(flights, month == 11 | month == 12)
# A tibble: 55,403 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013    11     1        5       2359       6     352     345       7 B6     
 2  2013    11     1       35       2250     105     123    2356      87 B6     
 3  2013    11     1      455        500      -5     641     651     -10 US     
 4  2013    11     1      539        545      -6     856     827      29 UA     
 5  2013    11     1      542        545      -3     831     855     -24 AA     
 6  2013    11     1      549        600     -11     912     923     -11 UA     
 7  2013    11     1      550        600     -10     705     659       6 US     
 8  2013    11     1      554        600      -6     659     701      -2 US     
 9  2013    11     1      554        600      -6     826     827      -1 DL     
10  2013    11     1      554        600      -6     749     751      -2 DL     
# … with 55,393 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

in:

Code
filter(flights, month %in% c(11, 12))
# A tibble: 55,403 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013    11     1        5       2359       6     352     345       7 B6     
 2  2013    11     1       35       2250     105     123    2356      87 B6     
 3  2013    11     1      455        500      -5     641     651     -10 US     
 4  2013    11     1      539        545      -6     856     827      29 UA     
 5  2013    11     1      542        545      -3     831     855     -24 AA     
 6  2013    11     1      549        600     -11     912     923     -11 UA     
 7  2013    11     1      550        600     -10     705     659       6 US     
 8  2013    11     1      554        600      -6     659     701      -2 US     
 9  2013    11     1      554        600      -6     826     827      -1 DL     
10  2013    11     1      554        600      -6     749     751      -2 DL     
# … with 55,393 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

not in:

Code
filter(flights, !month %in% c(11, 12))
# A tibble: 281,373 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 281,363 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Missing values

6 Workflow: scripts

7 Exploratory Data Analysis

8 Workflow: projects

9 Introduction

10 Tibbles

11 Data import

12 Tidy data

13 Relational data

14 Strings

15 Factors

16 Dates and times

Reference

[Book]ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham https://ggplot2-book.org/index.html